This report reads an existing R dataset (officially a “data frame”) and makes various descriptive summaries, packaging text, code, and output into an html file. Some of the graphics are rendered both as static and as interactive graphs. The interactive ones allow you to pan, zoom, and hover with the mouse to see more details. Interactive graphics are constructed using the R plotly package, which the Hmisc package loads automatically. To run this report you’ll need to install the Hmisc and plotly packages and their dependencies (RStudio will take care of the dependencies).
Loading User-Contributed R Packages
In R a package is a collection of functions, function documentation (“help files”), and example datasets. When you install R and RStudio many standard packages are automatically installed. There are over 10,000 user-contributed packages that extend what R can do. These include specialty packages for biomedical research (e.g., flow cytometry analysis, genomics, interfaces to REDCap) and a huge number of packages for analysis, table making, and graphics. After using the RStudioTools menu to install add-on packages, you load packages into R to make them available when running your report by using require or library commands. In the following we get access to the Hmisc package.
Code
require(Hmisc) # Hmisc must already be installed# For certain Hmisc & rms functions set default output format to htmloptions(prType='html')
You can load this complete report script into your RStudio script editor window by running the command (usually from the R console under RStudio) getRs('stressecho.qmd', put='rstudio').
Accessing Data
There are many ways to import data into R, as described here. Much of the time you will be importing .csv or REDCap files or binary files created by Stata, SPSS, or SAS. For learning R there is a wide variety of ready-to-use datasets on the Vanderbilt Department of Biostatistics dataset repository at hbiostat.org/data. These datasets are already in R binary format and many of them are fully documented and annotated. In the Hmisc package there is a function getHdata that finds these datasets, downloads them from the web server, and loads them into R’s memory for easy access.
Use thegetHdata function to download the stress echocardiogram dataset and read it into R’s memory. This dataset is annotated with variable labels and, for continuous variables, units of measurement. These annotations are used by contents, describe, and certain graphics functions.
Sometimes datasets have long names that are tedious to type. I use a convention of copying the currently active dataset into an R data frame called d to save typing during analysis.
Code
getHdata(stressEcho)d <- stressEcho # copy dataset so can refer to short name d
Data Dictionary
The names of variables in the stress echo dataset can be printed by using the command names(d). But let’s use an Hmisc package function to provide more metadata
Code
contents(d)
d Contents
Data frame:d
558 observations and 31 variables, maximum # NAs:0
In the main output you’ll find that the number of levels for categorical variables (R factor variables) are highlighted. These are hyperlinks, and if you click on the number the pointer will jump to the list of levels below.
Descriptive Statistics
The Hmisc package describe function produces appropriate descriptive statistics for each variable in a data frame depending on the variable’s nature (binary, categorical, continuous, etc.). If you just type describe(d) you’ll get plain text output with no graphics. Running the output of describe through the html function converts it to html format which contains small graphics files to render high-resolution spike histograms for continuous variables.
See this for definitions of some of the items in describe output.
Code
des <-describe(d) # store results of describe() in desdes # can also use html(describe(d))
restwma: Resting wall motion abnormality on echocardiogram
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.745
257
0.4606
0.4978
posSE: Positive stress echocardiogram
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.553
136
0.2437
0.3693
newMI: New myocardial infarction
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.143
28
0.05018
0.09549
newPTCA: Recent angioplasty
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.138
27
0.04839
0.09226
newCABG: Recent bypass surgery
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.167
33
0.05914
0.1115
death
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.123
24
0.04301
0.08247
hxofHT: History of hypertension
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.625
393
0.7043
0.4173
hxofDM: History of diabetes
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.699
206
0.3692
0.4666
hxofCig: History of smoking
n
missing
distinct
558
0
3
Value heavy moderate non-smoker
Frequency 122 138 298
Proportion 0.219 0.247 0.534
hxofMI: History of myocardial infarction
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.599
154
0.276
0.4004
hxofPTCA: History of angioplasty
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.204
41
0.07348
0.1364
hxofCABG: History of coronary artery bypass surgery
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.399
88
0.1577
0.2661
any.event: Death, newMI, newPTCA, or newCABG
n
missing
distinct
Info
Sum
Mean
Gmd
558
0
2
0.402
89
0.1595
0.2686
ecg: Baseline electrocardiogram diagnosis
n
missing
distinct
558
0
3
Value normal equivocal MI
Frequency 311 176 71
Proportion 0.557 0.315 0.127
describe also has a plot method that translates descriptive statistics to purely graphical form and draws larger high-resolution histograms for continuous variables. The plot method produces two graphics objects: one for categorical variables and one for continuous variables. We save the overall plot result in object p then reference two sub-objects1
1 For these static graphics we could have just typed plot(des) and two plots would appear in succession. That approach does not work when making interactive graphs later.
Code
p <-plot(des)
Code
p$Categorical
Figure 1: Categorical variables
Code
p$Continuous
Figure 2: Continuous variables
To render these two graphs in interactive format using plotly we set a system option that the Hmisc package monitors: grType. It’s default value is 'plain'.
Code
options(grType='plotly')p <-plot(des)
Code
p$Categorical
Figure 3: Categorical variables. Hover over points to see numerators and denominators of proportions, and hover over the leftmost point to see more information.
Code
p$Continuous
Figure 4: Continuous variables. Hover over the leftmost spike to see a full list of descriptive statistics. Hover over other spikes to see \(x\)-axis values and frequency counts.
Computing Environment
A function session stored in an object in the Hmisc package prints the current computing environment in a nice format.
`
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Hmisc_5.1-1
To cite R in publications use:
R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
---title: "Stress Echo Descriptive Analysis"author: "Frank Harrell"date: last-modifiedformat: html: cap-location: margin reference-location: margin code-tools: true code-fold: show code-block-bg: "#f1f3f5" code-block-border-left: "#31BAE9" self-contained: trueexecute: warning: false message: false---## Introduction This report reads an existing R dataset (officially a "data frame") and makes various descriptive summaries, packaging text, code, and output into an html file. Some of the graphics are rendered both as static and as interactive graphs. The interactive ones allow you to pan, zoom, and hover with the mouse to see more details. Interactive graphics are constructed using the R `plotly` package, which the `Hmisc` package loads automatically. To run this report you'll need to install the `Hmisc` and `plotly` packages and their dependencies (`RStudio` will take care of the dependencies).## Loading User-Contributed R PackagesIn R a _package_ is a collection of functions, function documentation ("help files"), and example datasets.When you install R and `RStudio` many standard packages are automatically installed. There are over 10,000 user-contributed packages that extend what R can do. These include specialty packages for biomedical research (e.g., flow cytometry analysis, genomics, interfaces to REDCap) and a huge number of packages for analysis, table making, and graphics. After using the `RStudio``Tools` menu to install add-on packages, you load packages into R to make them available when running your report by using `require` or `library` commands. In the following we get access to the [`Hmisc` package](https://hbiostat.org/R/Hmisc).```{r setup}require(Hmisc) # Hmisc must already be installed# For certain Hmisc & rms functions set default output format to htmloptions(prType='html')```You can load this complete report script into your `RStudio` script editor window by running the command (usually from the R `console` under `RStudio`) `getRs('stressecho.qmd', put='rstudio')`.## Accessing DataThere are many ways to import data into R, as described [here](https://hbiostat.org/rflow/fcreate.html). Much of the time you will be importing `.csv` or `REDCap` files or binary files created by `Stata`, `SPSS`, or `SAS`. For learning R there is a wide variety of ready-to-use datasets on the Vanderbilt Department of Biostatistics dataset repository at [hbiostat.org/data](https://hbiostat.org/data). These datasets are already in R binary format and many of them are fully documented and annotated. In the `Hmisc` package there is a function `getHdata` that finds these datasets, downloads them from the web server, and loads them into R's memory for easy access.Use the`getHdata` function to download the [stress echocardiogram dataset](https://hbiostat.org/data/repo/stressEcho.html) and read it into R's memory. This dataset is annotated with variable labels and, for continuous variables, units of measurement. These annotations are used by `contents`, `describe`, and certain graphics functions.Sometimes datasets have long names that are tedious to type. I use a convention of copying the currently active dataset into an R data frame called `d` to save typing during analysis.```{r import}getHdata(stressEcho)d <- stressEcho # copy dataset so can refer to short name d```## Data DictionaryThe names of variables in the stress echo dataset can be printed by using the command `names(d)`. But let's use an `Hmisc` package function to provide more metadata```{r}contents(d)```In the main output you'll find that the number of levels for categorical variables (R `factor` variables) are highlighted. These are hyperlinks, and if you click on the number the pointer will jump to the list of levels below.## Descriptive StatisticsThe `Hmisc` package `describe` function produces appropriate descriptive statistics for each variable in a data frame depending on the variable's nature (binary, categorical, continuous, etc.). [See [this](https://hbiostat.org/R/glossary.html) for definitions of some of the items in `describe` output.]{.aside} If you just type `describe(d)` you'll get plain text output with no graphics. Running the output of `describe` through the `html` function converts it to html format which contains small graphics files to render high-resolution spike histograms for continuous variables.```{r}des <-describe(d) # store results of describe() in desdes # can also use html(describe(d))````describe` also has a `plot` method that translates descriptive statistics to purely graphical form and draws larger high-resolution histograms for continuous variables. The `plot` method produces two graphics objects: one for categorical variables and one for continuous variables. We save the overall `plot` result in object `p` then reference two sub-objects^[For these static graphics we could have just typed `plot(des)` and two plots would appear in succession. That approach does not work when making interactive graphs later.]```{r}p <-plot(des)``````{r}#| label: fig-cat#| fig-cap: "Categorical variables"p$Categorical``````{r}#| label: fig-cont#| fig-cap: "Continuous variables"p$Continuous```To render these two graphs in interactive format using `plotly` we set a system option that the `Hmisc` package monitors: `grType`. It's default value is `'plain'`.```{r}options(grType='plotly')p <-plot(des)``````{r}#| label: fig-cat-interactive#| fig-cap: "Categorical variables. Hover over points to see numerators and denominators of proportions, and hover over the leftmost point to see more information."p$Categorical``````{r}#| label: fig-cont-interactive#| fig-cap: "Continuous variables. Hover over the leftmost spike to see a full list of descriptive statistics. Hover over other spikes to see $x$-axis values and frequency counts."p$Continuous```## Computing EnvironmentA function `session` stored in an object in the `Hmisc` package prints the current computing environment in a nice format.`r markupSpecs$html$session()`